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Abstract. Modern graphics processors provide exceptional computa- 
tional power, but only for certain computational models. While they have 
revolutionized computation in many fields, compression has been largely 
unnaffected. This paper aims to explain the current issues and possibili- 
ties in GPGPU compression. This is done by a high level overview of the 
GPGPU computational model in the context of compression algorithms; 
along with a more in-depth analysis of how one would implement bzip2 
on a GPGPU architecture. 



1 Introduction 

Ever since PCs became common household items, and computer gaming hit 
mainstream attention during the 90s, the graphics processing unit (GPU) in- 
dustry has been growing steadily in along with the CPU industry. The need 
for faster and ever more visually stunning graphics has lead a head-on battle 
between nVidia and AMD for more powerful CPUs, and the consumers money. 
The colossal computational power on the GPU chips has until lately only been 
utilized to their full capacity in 3D gaming, but lately with the advent of new ar- 
chitectures is it possible to exploit the computational resources for more general 
purposes. 

General-purpose computing on graphics processing units (GPGPU) gives 
software designers the massive parallel computational power and memory band- 
width of modern GPUs, at the cost of more limited programming models. Some 
tasks, such as scientific computing and audio/video processing, are able to utilize 
these resources without large complications due to their paralellizable nature. 
Other tasks such as data compression, especially lossless data compression, are 
not as lucky and have more difficulties exploiting the power of GPUs. This pa- 
per focuses on lossless data compression and attempts to build a picture of the 
incompatibilities between GPGPU computation and data compression. On the 
other hand it also takes a look at what data compression related computations 
have successfully been ported to GPGPU platforms. 

The rest of the paper is split into three sections. The next section [5] describes 
the GPGPU platforms a bit closer, relating their characteristics to the needs of 
data compression. Section [3] shows some successful applications of GPGPU on 
compression algorithms, and describes what parts of a bzip2 implementations 
could be accelerated by GPGPUs. Finally in section [4] we discuss the current 
trends in the field of GPGPU compression. 



2 GPGPU 



Current leading edge GPU chips provide way higher computations per second 
compared to similarly priced leading edge CPU. The computational power of the 
GPUs is distributed over hundreds of computational cores, all working in paral- 
lel. To simplify synchronization of their concurrent computations, the GPGPU 
platforms run code in clusters of 32 threads called warps. Each of the threads in 
one warp executes the same instruction, on different data, at every given time. 
For fast performance, the data the processors operate on should be coalesced in 
continuous and aligned blocks in memory, to enable a single memory transfer for 
all data. Multiple warps are run in blocks of hundred of threads, which combine 
into a grid consisting of all executing threads. Until recently all threads would al- 
ways run the same program, but newer architectures have enabled simultaneous 
execution of multiple programs. 

This programming model fits nicely in image processing where each comput- 
ing core can process a single pixel, comparing it to nearby pixels in temporal or 
spatial dimensions, producing reductions or sensory compensations of them. On 
the other hand serial computations are slow on GPUs, as are many algorithms 
efficiently parallelizable on symmetric multiprocessing CPU systems. The lim- 
itations created by the tightly synchronized threads, and the memory access 
patterns restrict what is feasible to do on a GPGPU platform. 

The split system with the CPUs on the motherboard, and the GPUs as 
separate devices also lead to some obstacles. Firstly all data data that is to be 
processed on GPGPU needs to be copied over the PCIe bus. Although the PCIe 

2.0 bus is fast (8 GBPS) in comparison to many other transportation mediums, 
it does fall short of the parallel memory bandwidth on GPU cards (>200GBPS), 
which might lead to limits when wanting to do really fast stuff. Although the 
memory transfer from device (GPU) to host (CPU) can be performed in parallel 
with executed code, it always adds some extra latency to the overall runtime. 

2.1 GPGPU Data Compression 

The fact that lossless data compression is largely about finding redundancy 
within a set of data, leads to compression algorithms that process the data in 
one or many passes, comparing incoming data with previously encountered data. 
This makes many of them highly serial in nature, and are usually challenging 
to parallelize. The most common way of parallelizing compression algorithms is 
thus to split the data into smaller blocks and execute the normal serial algorithm 
on each one. As a downside, this separates the redundancies, giving slightly 
poorer compression ratio. There are multiple bzip2 implementations doing this 
with efficient scaling on multiple CPU cores [6j. This simple trick means that 
little research has been done on more efficiently parallelizable algorithms before 
GPGPU came along. 

The fact that GPGPUs have hundreds of processors and a non-general mem- 
ory accessing scheme, means that the method of splitting the data into smaller 
chunks isn't always feasible [16] . On the other hand if one aims at very trivial 



and simple compression by processing single fields or integers into more compact 
form, using for example null suppression or Elias delta codes, one can achieve 
amazing parallel performance |12) . 

The real deal of GPGPU compression is though to be found form exploiting 
more general computational methods available for GPGPU platforms. Firstly, 
being such a widely used general tool, sorting has several efficient parallel imple- 
mentations, such as merge sort executing 0(n log n) steps, in O(logn) time for a 
input of size n. On modern hardware it gives about a tenfold speedup compared 
to CPU implementations |13j . 

Perhaps more interesting is the parallel implementation of counting prefix 
sums. This algorithm computes intermediate values in a tree-like fashion pro- 
ducing an output with 0(n) steps, in O(logn) time and being around ^20x 
faster than CPU implementations on modern hardware . This has interesting 
utilizations in many GPGPU algorithms such as sorting [TU] and lexicographic 
naming |14) . In the next chapter where we will look into some parallel imple- 
mentations of classic compression methods, where we can see more uses for the 
prefix sum. 

3 Current compression algorithms 

This chapter looks at some widely used compression algorithms, also used in 
bzip2, to see what parts of them have been successfully parallelized on GPGPU 
platforms. Notice beforehand that although the algorithms in this chapter are 
successful at compressing the incoming data, a efficient GPGPU decompressor 
might not always be available. 

3.1 Burrows- Wheeler Transform (BWT) 

The bzip2 compression stack starts with BWT [4] to reorder the characters in 
the data to a more compressible form. Since the transform is simply executed by 
sorting the rotations of the input data, we can use any efficient sorting algorithm 
available for GPGPU. As stated above, there are many available that would 
make the encoding possible in parallel giving reasonable speedups. Although no 
research papers seem to have been published on the subject yet, preliminary 
figures in the community give claims of reasonable speedup |15l9j . 

Unfortunately the inverse Burrows- Wheeler transform isn't as simple to par- 
allelize. One possibility for parallelization would be to utilize the property of the 
transform that the inverse can be started from any point on the output. Given 
this we can on each parallel processor start at a point in the transformed string 
and continue until we hit a character from which decoding has been started by 
another processor. Problems would possibly arise from poor GPU performance 
of the very random memory accessing caused by the scattering of characters 
throughout the string. 
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Fig. 1. Run- length encoding. Empty cells are not part of the array, but are added 
to retain the straight flow of values. Thus the last three arrays are of size 3. 



3.2 Run-Length Encoding (RLE) 

Moving into the next step of the bzip2 compression, we encounter RLE, which 
compresses long runs of same characters into a single character along with a 
integer specifying the amount of times the character was repeated. Since RLE 
only focuses on very local redundancies between the characters, it is easier to 
design parallel algorithms for it. One such algorithm has been researched by W. 
Fang et al. [5], and similar work has been done by A. Balevic pQ. 

The algorithm described by Fang utilizes as a central part the parallel im- 
plementation for counting prefix sums, mentioned above. The encoding works in 
four parallel steps, as can be seen in figure [T] In the first step the boundaries 
between the runs are identified by having each processor compare its designated 
character with the neighboring one, outputting Is in an auxiliary array to mark 
the boundaries. In the second step the prefix sum of the auxiliary array is com- 
puted to get the output positions for the characters and run lengths. In step 
3 the algorithm simply copies the index of the boundary in the original data 
along with the character into the output array. The final step, executed on the 
now smaller output array, computes the lengths of the runs by comparing the 
differences of the indices. 

RLE is one of the few GPGPU compression algorithms for which a efficient 
decoding has been found and implemented. Working similarly as the encoding 
in reverse, figure [2] shows the four steps of the algorithm. First the indices of the 
boundaries are calculated from the run lengths using the prefix sum algorithm. 
Secondly Is are written into an auxiliary array at the boundaries in the previous 
result. In step 3 the prefix sum is run on the auxiliary array to compute an array 
containing indices of the compressed input. In the last step characters are copied 
to the output array according to the indices computed in the previous step. 
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Fig. 2. Run-length decoding 



3.3 Move-to- front 

Although no published research could be found on MTF encoding, there seems 
to be a possibility for trivial parallelization. By parallelizing over the stack of 
recently used symbols to find the index of the read character, we could reduce 
the time of looking up the index of the character. In practice each processor 
would check if the character at the processors index matches the input character 
and output this index to some shared register. Since each character occurs in 
the stack only once, no concurrency race conditions will occur for the write. 
Updating the state of the stack can also be similarly parallelizable. 

This would lead to constant time lookup of the index, provided that the GPU 
has equal amount or more processors than amount characters in the alphabet. 
Although every core in the GPU is easily utilized, one must take note that this 
implementation would be executing more redundant computations compared 
to a serial implementation which would stop after finding the first, and only, 
occurrence. 

Speeding up decoding on GPGPU platforms might be more challenging since 
the character lookup is already constant time on serial implementations, and 
starting decoding from multiple places is difficult since the state of the stack is 
not known at the other places. 



3.4 Variable-Length Encoding (VLE) 

The last step of bzip2 covered in this paper is the Huffman coding 8 . Although 
no GPGPU algorithms are at the time available for constructing the actual 
Huffman tree, there exists at least one algorithm by A. Balevic [2], focusing on 
the substitution from uncompressed data to variable-length encoding. Although 
the substitution seems like a trivial task for a serial algorithm, this isn't the 
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Step 1: Codeword lookup 
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Fig. 3. Variable length encoding, xxxx represents one generic uncompressed 
character, and yyy represents it's generic codeword. 



case when we want to parallelize. The problem arises from the fact that due 
to variable length codewords, we don't know where into the output array the 
converted codes should be copied. 

The encoding algorithm works as in figure [3J First the input characters are 
converted to codewords in parallel, also storing the corresponding codeword 
length into an auxiliary array. Secondly the prefix sum is calculated from the 
auxiliary array to get the output positions of the codewords. Finally the code- 
words are copied into the calculated indexes. 

Here again, decompression is harder. This is due to the fact that the decoder 
doesn't know where one codeword ends and another begins before it has decoded 
the whole prior input. Some work on parallelizing variable-length codes using 
error-resilience methods has been done [3J Ch. 3.10], and could possibly be used 
on GPGPU platforms. 

3.5 Other compression usages 

In addition to general lossless data compression focusing on large scale redun- 
dancies as in the case of bzip2, successful work has also been done in other types 
of GPGPU data compression applications. Firstly a team at the Texas state 
university has implemented a high-speed floating point compression algorithm 
called GFC [12] that touts 75GB/s compression speeds for usage in scientific 
computation where massive amounts of data are generated and need to be pro- 
cessed in real time. The compression methods used are quite simple, mostly 
focusing on only storing deltas between floating point values and removing un- 
necessary zeroes form the result. Unfortunately, the practical processing speed 
of the compression is limited by the PCIe bus to 8GB/s. 

Other work has been done in database data compression [5], where multiple 
different light-weight compression methods are combined to speed up GPU co- 
processing of database queries. Compressing the data makes it faster to move 
the data over the PCIe bus, and faster to read from disk. The compression 
techniques focus on compression of single database fields such as integers, using 
null-suppression, dictionaries, RLE and more. 



4 Discussion 



The compression algorithms described above compose the central parts of bzip2. 
Given proper implementations of each, it shouldn't be hard to combine them 
togther for a bzip2 compatible compressing tool that could deliver high perfor- 
mance on modern GPUs. The efficiency of the MTF algorithm is debatable. But 
the benefit of having all code running on the GPU, removing unnecessary data 
transfer over the PCIe bus, might still make the overall compression speed faster. 

As for the codeword tables for the VLE, it seems unlikely that efficient 
Huffman-tree GPGPU algorithms will be possible. But this doesn't mean that 
pre-calculated ones couldn't be used, combined with some data analysis for 
choosing a fitting one for the current input data. 

Although most parallel compression methods depend on splitting the com- 
pressible data into small chunks, cutting the dependancies between the chunks, 
we have here taken a look at some more sophisticated attempts at parallel com- 
pression on GPGPU platforms. GPGPU compression focusing on local changes 
can give very high compression band widths, while at the same time more wide 
scale algorithms are using general computational primitives for efficient paral- 
lelizable implementations. 

Future work will most likely be affected by the different general algorithms 
discovered for GPGPU platforms, and made easily available by libraries such as 
Thrust [7]. 
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